To Teach or not to Teach? Decision Making Under Uncertainty in Ad Hoc Teams Supplemental Material

نویسندگان

  • Peter Stone
  • Sarit Kraus
چکیده

This file complains supplementary material, consisting of a proof and the details of an algorithm, to accompany a paper that appears in the proceedings of AAMAS 2010. The main paper is available at http://www.cs.utexas.edu/~pstone/ Papers/bib2html/b2hd-AAMAS2010-adhoc.html 3. PROOF OF THEORM 3.1 Theorem 3.1. It is never optimal for the teacher to pull Arm2. Proof. By induction on the number of rounds left, r. Base case: r = 1. If the teacher starts by pulling Arm2, the best expected value the team can achieve is μ2 + μ1. Meanwhile, if it starts with Arm∗, the worst the team expects is μ∗ + μ2. This expectation is higher since μ∗ > μ1. Inductive step: Assume that the teacher should never pull Arm2 with r− 1 rounds left. Let π∗ be the optimal teacher action policy that maps the states of the arms (their μi, ni, and x̄i) and the number of rounds left to the optimal action: the policy that leads to the highest long-term expected value. Consider the sequence, S, that begins with Arm2 and subsequently results from the teacher following π∗. To show: there exists a teacher action policy π′ starting with Arm∗ (or Arm1) that leads to a sequence T with expected value greater than that of S. That is, the initial pull of Arm2 in S does not follow π∗. In order to define such a policy π′, we define S1(n) and S2(n) as the number of pulls of Arm1 and Arm2 respectively after n total steps of S. As shorthand, we denote S(n) = (S1(n), S2(n)). Similarly, define the number of pulls of Arm1 and Arm2 after n steps of T (e.g. when using π′) as T (n) = (T1(n), T2(n)). Next, define the relation > such that T (n) > S(m) iff T1(n) ≥ S1(m) and T2(n) ≥ S2(m) where at least one of the inequalities is strict. That is T (n) > S(m) if at least one of the arms has pulled more times after n steps in T than after m steps in S, and neither arm has been pulled fewer times. Finally, we define the concept of the teacher simulating sequence S based on the knowledge of what values would have resulted from each of the actions, starting with the teacher’s pull of Arm2 at step 1. 1 It can only do that as Such simulation relies on an assumption that the payoffs long as it has already seen the necessary values — otherwise it does not know what the state of the sample averages would be when it is the learner’s turn to act. After n steps of the sequence T , let the number of steps that it can simulate in the S sequence be Sim(n). Specifically, Sim(n) is the largest value m such that T (n) ≥ S(m). By way of illustration, let the values that will be obtained from the first pulls of Arm2 be u0, u1, u2, . . . and let those that will be obtained from the first pulls of Arm1 be v0, v1, v2, . . .. Consider the following possible beginning of sequence S where pulls of Arm∗ are marked with a ∗, n is the step number, the teacher’s actions are in the row marked “Teacher” and the learner’s actions are in the row marked “Learner”(note that by the induction hypothesis, the teacher never pulls Arm2 after the first step). n: 1 2 3 4 5 6 7 8 9 10 . . . Teacher: u0 v1 a ∗ a∗ v4 . . . Learner: v0 v2 u1 v3 v5 . . . In this sequence, S(0) = (0, 0), S(1) = (0, 1), S(2) = (1, 1), S(3) = (2, 1), S(4) = S(5) = (3, 1), etc. Meanwhile, suppose that the teacher’s first action in sequence T is Arm∗ and the learner’s first action is Arm1, leading to v0. Then T (0) = T (1) = (0, 0) and T (2) = T (3) = (1, 0). Until the learner sees a pull from Arm2 in sequence T , it cannot simulate any steps of S: Sim(1) =Sim(2) =Sim(3) = 0. If the teacher’s second action in T is Arm∗ and learner’s 2nd action is Arm2, then in the example sequence above, Sim(4) = 2. We are now ready to define the teacher’s policy π′ for generating T . Let n be the total number of actions taken so far. Then: 1. If n = 0, T (n) > S(Sim(n)) or Sim(n) is odd, then select Arm∗; 2. Else (T (n) = S(Sim(n)) and Sim(n) is even), select the next action of S (i.e. the action π would select if there were r − Sim(n) 2 rounds left). Note that by the definition of Sim, it is always the case that T (n) ≥ S(Sim(n)). Further, note that at the beginning we are in step 1 of the strategy: T (2) = (1, 0) > (0, 0) = from an arm are queued up and will come out the same no matter when the arm is pulled: they are not a function of the times at which the arm is pulled, or the payoffs from any other arms. However, our argument still holds if the payoffs are time-dependent and/or dependent on other arms as long as the teacher has no knowledge of the nature of this dependency. S(Sim(2)). It remains to show that the sequence T resulting from using this policy π′ has an expected value greater than that of S. We prove this in two cases. Case 1: There is a least n, call it n′, such that T (n) = S(Sim(n)) and Sim(n) is even. Until that point, the teacher keeps pulling Arm∗. We can thus show that Sim(n′) < n′ as follows. After n′ steps, there are exactly n ′ 2 u’s and v’s in the T sequence (T1(n ′) + T2(n ′) = n ′ 2 ). But after n′ steps, there are at least n ′ 2 +1 u’s and v’s in the S sequence (S1(n )+S2(n ′) ≥ n ′ 2 +1) because the first value is a u and all the learner’s actions are u’s or v’s. Thus the simulation of S always lags behind T in terms of number of steps simulated: Sim(n′) < n′. Note that if it is ever the case that T (n) = S(Sim(n)) and Sim(n) is odd (it is the learner’s turn to act in S), then the teacher will pull Arm∗ once more after which the learner will do what it would have done in sequence S after Sim(n) steps. That will cause both T (n) and S(Sim(n)) to increment by the same amount, and Sim(n) to be even. Thus in the subsequent round, the teacher will switch to step 2 of its strategy. Once the teacher has switched to step 2 of its strategy, then it will continue using that step: sequence T will follow S exactly for its remaining 2r−n′ steps. To see that, observe that in each round, T (n) and S(n) will increment by the same amount, and Sim(n) will increment by exactly 2, thus remaining even. Now compare the sequences T and S. Up until the point of step n′ in T and Sim(n′) in S, the only difference between the sequences is that there are n′− Sim(n′) extra pulls of Arm∗ in T . There then follow 2r − n′ steps in the two sequences that are identical. The final n′ − Sim(n′) steps in S include at least one pull of Arm1 or Arm2 (the learner’s first action). Thus the expected value of T − S (the difference between the sum of their expected values) is at least μ∗ − μ1 > 0. Case 2: It is never the case that T (n) = S(Sim(n)) and Sim(n) is even. Then the teacher continues playing Arm∗ throughout the T sequence (r times). First, by the same argument as above, since the teacher always pulls Arm∗, it is always the case that Sim(n ′) < n′. Next, we argue that T2(2r) = S2(Sim(2r)). That is, after Sim(2r) steps, the next step in S is a pull of Arm2 (because x̄2 > x̄1). Otherwise, S could be simulated another step further by consuming another v value from T . We show this by induction on the number of steps in the T sequence i, showing that it is always the case that T2(i) = S2(Sim(i)). This equation holds at the beginning (e.g. when i = 2): T (2) = (1, 0), S(Sim(2)) = (0, 0), so T2(2) = S2(Sim(2)) = 0. Now assume T2(i− 1) = S2(Sim(i− 1)). There are three possibilities for the next action in T . If it is a pull of Arm∗ or Arm1, then T2(i) = T2(i − 1) and Sim(i) = Sim(i − 1) =⇒ S2(Sim(i)) = S2(Sim(i − 1)), so the condition still holds. If it is a pull of Arm2, then T2(i) = T2(i− 1) + 1 and S2(Sim(i)) = S2(Sim(i−1))+1 because the new u value can be used to continue the simulation of S by at least one step, and there are no additional u’s in T to increase S2(Sim(i)) any further. Therefore T2(i) = S2(Sim(i)). Note that in general, S1(Sim(i)) could be much greater than S1(Sim(i − 1)): there could be several v values from T that are then able to be used for simulating S. But if all of the available v’s from T are used, we get that T (i) = S(Sim(i)), which violates the Case 2 assumption and puts us into Case 1 above (or will put us there one round later if

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

To teach or not to teach?: decision making under uncertainty in ad hoc teams

In typical multiagent teamwork settings, the teammates are either programmed together, or are otherwise provided with standard communication languages and coordination protocols. In contrast, this paper presents an ad hoc team setting in which the teammates are not pre-coordinated, yet still must work together in order to achieve their common goal(s). We represent a specific instance of this sc...

متن کامل

A novel risk-based analysis for the production system under epistemic uncertainty

Risk analysis of production system, while the actual and appropriate data is not available, will cause wrong system parameters prediction and wrong decision making. In uncertainty condition, there are no appropriate measures for decision making. In epistemic uncertainty, we are confronted by the lack of data. Therefore, in calculating the system risk, we encounter vagueness that we have to use ...

متن کامل

A New Balancing and Ranking Method based on Hesitant Fuzzy Sets for Solving Decision-making Problems under Uncertainty

The purpose of this paper is to extend a new balancing and ranking method to handle uncertainty for a multiple attribute analysis under a hesitant fuzzy environment. The presented hesitant fuzzy balancing and ranking (HF-BR) method does not require attributes’ weights through the process of multiple attribute decision making (MADM) under hesitant conditions. For the rating of possible alternati...

متن کامل

Adapting Plans through Communication with Unknown Teammates: (Doctoral Consortium)

Coordinating a team of autonomous agents is a challenging problem. Agents must act in such a way that makes progress toward the achievement of a goal while avoiding conflict with their teammates. In information asymmetric domains, it is often necessary to share crucial observations in order to collaborate effectively. In traditional multiagent systems literature, these teams of agents share an ...

متن کامل

آموزش خودکارآمدی به دانش آموزان ناشنوای هنرستانی

Today, paying attention to the students' social skills having special needs who have difficulties in social interaction and the decision making methods, is of a high importance. This issue is of a high importance among the hearing-impaired students at an institute. One of the social growth processes in individual is self-sufficiency. This concept in Albert Bandore's social growth theory is a be...

متن کامل

Web-Based Resource Coordination for Effective Distributed Collaborative Decision Making

Effective use of Web applications by distributed heterogeneous work teams depends on team ability to effectively discover, retrieve, and coordinate technological and human resources. Successful process and product delivery also depend on both push and pull delivery of information to meet both ad hoc and ongoing information resource needs. The described research extends current theory by analyzi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010